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Abstract 

The  problem  considered  is  that  of  improving  the  accuracy  of  an  hypothesis  output 
by  a  learning  algorithm  in  the  distribution-free  (PAC)  learning  model.  A  concept  class 
is  leamable  (or  strongly  leamable)  if,  given  access  to  a  source  of  examples  from  the 
unknown  concept,  the  learner  with  high  probability  is  able  to  output  an  hypothesis 
that  is  correct  on  all  but  an  arbitrarily  small  fraction  of  the  instances.  The  concept 
class  is  weakly  leamable  if  the  learner  can  produce  an  hypothesis  that  performs  only 
slightly  better  than  random  guessing.  In  this  paper,  it  is  shown  that  these  two  notions 
of  learnability  are  equivalent. 

A  method  is  described  for  converting  a  weak  learning  algorithm  into  one  that 
achieves  arbitrarily  high  accuracy.  This  construction  may  have  practical  applications 
as  a  tool  for  efficiently  converting  a  mediocre  learning  algorithm  into  one  that  per¬ 
forms  extremely  well.  In  addition,  the  construction  has  some  interesting  theoretical 
consequences,  including  a  set  of  general  upper  bounds  on  the  complexity  of  any  strong 
learning  algorithm  as  a  function  of  the  allowed  error  f. 

Keywords:  Machine  learning,  learning  from  examples,  polynomial-time  identification. 


1  Introduction 

Since  Valiant’s  pioneering  paper  [23],  interest  has  flourished  in  the  so-called  distribution- 

free  or  probably  approximately  correct  (PAC)  model  of  learning.  In  this  model,  the  learner 

tries  to  identify  an  unknown  concept  based  on  randomly  chosen  examples  of  the  concept. 

Examples  are  chosen  according  to  a  fixed  but  unknown  and  arbitrary  distribution  on  the 

This  paper  prepared  with  support  from  ARO  Grant  DAAL03-86-K-0171,  DARPA  Contract  N00014-89- 
J- 1988,  and  a  grant  from  the  Siemens  Corporation. 
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space  of  instances.  The  learner’s  task  is  to  find  an  hypothesis  or  prediction  rule  of  his  own 
that  correctly  classifies  new  instances  as  positive  or  negative  examples  of  the  concept.  With 
high  probability,  the  hypothesis  must  be  correct  for  all  but  an  arbitrarily  small  fraction  of 
the  instances. 

Often,  the  inference  task  includes  a  requirement  that  the  output  hypothesis  be  of  a 
specified  form.  In  this  paper,  however,  we  will  instead  be  concerned  with  a  representation- 
independent  model  of  learning  in  which  the  learner  may  output  any  hypothesis  that  classifies 
instances  in  polynomial  time. 

A  class  of  concepts  is  learnable  (or  strongly  learnable)  if  there  exists  a  polynomial-time 
algorithm  that  achieves  low  error  with  high  confidence  for  all  concepts  in  the  class.  A 
weaker  model  of  learnability,  called  weak  learnability,  drops  the  requirement  that  the  learner 
be  able  to  achieve  arbitrarily  high  accuracy;  a  weak  learning  algorithm  need  only  output  an 
hypothesis  that  performs  slightly  better  (by  an  inverse  polynomial)  than  random  guessing. 
The  notion  of  weak  learnability  was  introduced  by  Kearns  and  Valiant  [17,  18]  who  left  open 
the  question  of  whether  the  notions  of  strong  and  weak  learning  are  equivalent.  This  question 
was  termed  the  hypothesis  boosting  problem  since  showing  the  notions  are  equivalent  requires 
a  method  for  boosting  the  low  accuracy  of  a  weak  learning  algorithm’s  hypotheses. 

Kearns  [15],  considering  the  hypothesis  boosting  problem,  gives  a  convincing  argument 
discrediting  the  natural  approach  of  trying  to  boost  the  accuracy  of  a  weak  learning  algorithm 
by  running  the  procedure  many  times  and  taking  “majority  vote”  of  the  output  hypotheses. 
Kearns  and  Valiant  [14,  17]  show  that,  under  a  uniform  distribution  on  the  instance  space, 
monotone  Boolean  functions  are  weakly,  but  not  strongly,  learnable.  This  implies  that  strong 
and  weak  learnability  are  not  equivalent  when  certain  restrictions  are  placed  on  the  instance 
space  distribution.  Thus,  it  did  not  seem  implausible  that  the  strong  and  weak  learning 
models  would  prove  to  be  inequivalent  for  unrestricted  distributions  as  well. 

Nevertheless,  in  this  paper,  the  hypothesis  boosting  question  is  answered  in  the  affirma¬ 
tive.  The  main  result  is  a  proof  of  the  perhaps  surprising  equivalence  of  strong  and  weak 
learnability. 

This  result  may  have  significant  applications  as  a  tool  for  proving  that  a  concept  class 
is  learnable  since,  in  the  future,  it  will  suffice  to  find  an  algorithm  correct  on  only,  say,  51% 
of  the  instances  (for  all  distributions).  Alternatively,  in  its  negative  contrapositive  form,  the 
result  says  that,  if  a  concept  class  cannot  be  learned  with  accuracy  99.9%,  then  we  cannot 
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hope  to  do  even  slightly  better  than  guessing  on  the  class  (for  some  distribution). 

The  proof  presented  here  is  constructive;  an  explicit  method  is  described  for  directly 
converting  a  weak  learning  algorithm  into  one  that  achieves  arbitrary  accuracy.  The  con¬ 
struction  uses  filtering  to  modify  the  distribution  of  examples  in  such  a  way  as  to  force  the 
weak  learning  algorithm  to  focus  on  the  harder-to-learn  parts  of  the  distribution.  Thus,  the 
distribution-free  nature  of  the  learning  model  is  fully  exploited. 

An  immediate  corollary  of  the  main  result  is  the  equivalence  of  strong  and  group  learnabil- 
ity.  A  group-learning  algorithm  need  only  output  an  hypothesis  capable  of  classifying  large 
groups  of  instances,  all  of  which  are  either  positive  or  negative.  The  notion  of  group  learnabil- 
ity  was  considered  by  Kearns  et  al.  [16],  and  was  shown  to  be  equivalent  to  weak  learnability 
by  Kearns  and  Valiant  [14,  17].  The  result  also  extends  those  of  Haussler  et  al.  [10]  that 
prove  the  equivalence  of  numerous  variations  and  relaxations  on  the  basic  PAC-learning 
model;  both  weak  and  group  learnability  are  added  to  this  general  class  of  equivalent  learn¬ 
ing  models.  The  relevance  of  the  main  result  to  a  number  of  other  learning  models  is  also 
considered  in  this  paper. 

An  interesting  and  unexpected  consequence  of  the  construction  is  a  proof  that  any  strong 
learning  algorithm  outputting  hypotheses  whose  length  (and  thus  whose  time  to  evaluate) 
depends  on  the  allowed  error  e  can  be  modified  to  output  hypotheses  of  length  only  poly¬ 
nomial  in  log(l/c).  Thus,  any  learning  algorithm  can  be  converted  into  one  whose  output 
hypotheses  do  not  become  significantly  more  complex  as  the  error  tolerance  is  lowered. 

This  bound  on  the  size  of  the  output  hypothesis  implies  the  hardness  of  learning  any 
concept  class  not  evaluatable  by  a  family  of  small  circuits.  For  example,  this  shows  that 
pattern  languages  —  a  class  of  languages  considered  previously  by  Angluin  [1]  and  others 
—  are  unlearnable  assuming  only  that  NP/poly  ^  P/poly.  This  is  the  first  representation- 
independent  hardness  result  not  based  on  cryptographic  assumptions.  The  bound  also  shows 
that,  for  any  function  not  computable  by  polynomial-size  circuits,  there  exists  a  distribution 
on  the  function’s  domain  over  which  the  function  cannot  be  even  roughly  approximated  by 
a  family  of  small  circuits. 

In  addition  to  the  bound  on  hypothesis  size,  the  construction  implies  a  set  of  general 
bounds  on  the  dependence  on  e  of  the  time,  sample  and  space  complexity  needed  to  efficiently 
learn  any  learnable  concept  class.  Most  surprising  is  a  proof  that  there  exists  for  every 
learnable  concept  class  an  efficient  algorithm  requiring  space  only  poly-logarithmic  in  1/t. 
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Because  the  size  of  the  sample  needed  to  learn  with  this  accuracy  is  in  general  fl(l/e),  this 
means,  for  example,  that  far  less  space  is  required  to  learn  than  would  be  necessary  to  store 
the  entire  sample.  Since  most  of  the  known  learning  algorithms  work  in  exactly  this  manner 
—  i.e.,  by  storing  a  large  sample  and  finding  an  hypothesis  consistent  with  it  —  this  Implies 
a  dramatic  savings  of  memory  for  a  whole  class  of  algorithms  (though  possibly  at  the  cost 
of  requiring  a  larger  sample). 

Such  general  complexity  bounds  have  implications  for  the  on-line  learning  model  as  well. 
In  this  model,  the  learner  is  presented  one  instance  at  a  time  in  a  series  of  trials.  As  each  is 
received,  the  learner  tries  to  predict  the  true  classification  of  the  new  instance,  attempting 
to  minimize  the  number  of  mistakes,  or  prediction  errors. 

Translating  the  bounds  described  above  into  the  on-line  model,  it  is  shown  that,  for 
every  learnable  concept  class,  there  exists  an  on-line  algorithm  whose  space  requirements 
are  quite  modest  in  comparison  to  the  number  of  examples  seen  so  far.  In  particular,  the 
space  needed  on  the  first  m  trials  is  only  poly-logarithmic  in  m.  Such  space  efficient  on¬ 
line  algorithms  are  of  particular  interest  because  they  capture  the  notion  of  an  incremental 
algorithm  forced  by  its  limited  memory  to  explicitly  generalize  or  abstract  from  the  data 
observed.  Also,  these  results  on  the  space-efficiency  of  batch  and  on-line  algorithms  extend 
the  work  of  others  interested  in  this  problem,  including  Boucheron  and  Sallantin  [6],  Floyd  [8], 
and  Haussler  [9].  In  particular,  these  results  solve  an  open  problem  proposed  by  Haussler, 
Littlestone  and  Warmuth  [11]. 

An  interesting  bound  is  also  derived  on  the  expected  number  of  mistakes  made  on  the 
first  m  trials.  It  is  shown  that,  if  a  concept  class  is  learnable,  then  there  exists  an  on-line 
algorithm  for  the  class  for  which  this  expectation  is  bounded  by  a  polynomial  in  logm.  Thus, 
for  large  m.  we  expect  an  extremely  small  fraction  of  the  first  m  predictions  to  be  incorrect. 
This  result  answers  another  open  question  given  bv  Haussler,  Littlestone  and  Warmuth  [11], 
and  significantly  improves  a  similar  bound  given  in  their  paper  (as  well  as  their  paper  with 
Kearns  [10])  of  nr'  for  some  constant  a  <  1. 

2  Preliminaries 

We  begin  with  a  description  of  the  distribution-free  learning  model.  A  concept  c  is  a  Boolean 
function  on  some  domain  of  instances.  A  concept  class  C  is  a  collection  of  concepts.  Often, 
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C  is  decomposed  into  subclasses  C„  indexed  by  a  parameter  n.  That  is,  C  =  Un>i  C n ,  and  all 
the  concepts  in  Cn  have  a  common  domain  A'n.  We  assume  each  instance  in  Xn  has  encoded 
length  bounded  by  a  polynomial  in  n,  and  we  let  X  =  Un>i  -Yn.  Also,  we  associate  with 
each  concept  c  its  size  s,  typically  a  measure  of  the  length  of  c  under  some  encoding  scheme 
on  the  concepts  in  C. 

The  learner  is  assumed  to  have  access  to  a  source  EX  of  examples.  Each  time  EX  is 
called,  one  instance  is  randomly  and  independently  chosen  from  Xn  according  to  some  fixed 
but  unknown  and  arbitrary  distribution  D.  The  oracle  returns  the  chosen  instance  i>,  along 
with  a  label  indicating  the  value  c(v)  of  the  instance  under  the  unknown  target  concept 
c  (E  Cn.  Such  a  labeled  instance  is  called  an  example.  We  assume  EX  runs  in  unit  time. 

Given  access  to  EX,  the  learning  algorithm  runs  for  a  time  and  finally  outputs  an  hy¬ 
pothesis  h,  a  prediction  rule  on  Xn.  In  this  paper,  we  make  no  restrictions  on  h  other  than 
that  there  exist  a  (possibly  probabilistic)  polynomial  time  algorithm  that,  given  h  and  an 
instance  v,  computes  h(v),  h's  prediction  on  v. 

We  write  Prue£>[7r(e)]  to  indicate  the  probability  of  predicate  ir  holding  on  instances  v 
drawn  from  Xn  according  to  distribution  D.  To  accommodate  probabilistic  hypotheses,  we 
will  find  it  useful  to  regard  7r(u)  as  a  Bernoulli  random  variable.  For  example,  Pr[/i(u)  c(r)] 
is  the  chance  that  hypothesis  h  (which  may  be  randomized)  will  misclassify  some  particu¬ 
lar  instance  v.  In  contrast,  the  quantity  Pr„e£i[/r(r>)  ^  c(v)]  is  the  probability  that  h  will 
misclassify  an  instance  v  chosen  at  random  according  to  distribution  D.  Note  that  this  last 
probability  is  taken  over  both  the  random  choice  of  v,  and  any  random  bits  used  by  h. 

In  general,  assuming  independence,  we  have 

=  51  D(v)  Pr[7r(u)j 

V^Xn 

where  D(v)  is  the  probability  of  instance  v  being  chosen  under  D.  (Technically,  this  formula 
is  valid  only  when  Xn  is  discrete.  If  A'n  is  continuous,  then  the  summation  would  need  to  be 
replaced  by  the  appropriate  integral,  and  D  by  a  probability  density  function.  To  simplify 
the  presentation,  we  will  assume  that  X„  is  discrete,  and  omit  the  extension  of  these  results 
to  continuous  domains.) 

The  probability  Prt,ey[/)(e)  ^  c(u)j  is  called  the  error  of  h  on  c  under  D;  if  the  error  is 
no  more  than  t,  then  we  say  h  is  t-clost  to  the  target  concept  r  under  D.  The  quantity 
Pr„e£)[/i(r;)  =  c(y)]  *s  the  accuracy  of  h  on  c  under  /). 


We  say  that  a  concept  class  C  is  learnable,  or  strongly  leamable ,  if  there  exists  an  algorit  hm 
A  such  that  for  all  n  >  1,  for  all  target  concepts  c  e  Cn,  for  all  distributions  D  on  Xn,  and 
for  all  0  <  t,8  <  1,  algorithm  A,  given  parameters  n,  e,  6,  the  size  s  of  c,  and  access  to  oracle 
EX,  runs  in  time  polynomial  in  n,  s,  1/e  and  1/8,  and  outputs  an  hypothesis  h.  that  with 
probability  at  least  1  —  8  is  e-close  to  c  under  D.  There  are  many  other  equivalent  notions 
of  learnability,  including  polynomial  predictability  [10]. 

Kearns  and  Valiant  [17,  18]  introduced  a  weaker  form  of  learnability  in  which  the  error  e 
cannot  necessarily  be  made  arbitrarily  small.  A  concept  class  C  is  weakly  learnable  if  there 
exists  a  polynomial  p  and  an  algorithm  A  such  that  for  all  n  >  1,  for  all  target  concepts 
c  €  Cn ,  for  all  distributions  D  on  Xn,  and  for  all  0  <  8  <  1,  algorithm  A,  given  parameters 
n,  8,  the  size  s  of  c,  and  access  to  oracle  EX,  runs  in  time  polynomial  in  n,  s  and  1/8,  and 
outputs  an  hypothesis  h  that  with  probability  at  least  1  —  8  is  (i  —  ^--j) -close  to  c  under 
D.  In  other  words,  a  weak  learning  algorithm  produces  a  prediction  rule  that  performs  just 
slightly  better  than  random  guessing. 

3  The  Equivalence  of  Strong  and  Weak  Learnability 

The  main  result  of  this  paper  is  a  proof  that  learnability  and  weak  learnability  are  equivalent 
notions. 

Theorem  3.1  A  concept  class  C  is  weakly  learnable  if  and  only  if  it  is  learnable. 

That  strong  learnability  implies  weak  learnability  is  trivial.  The  remainder  of  this  section 
is  devot  ed  to  a  proof  of  the  converse.  We  assume  then  that  some  concept  class  C.  is  weakk 
learnable  and  show  how  to  build  a  strong  learning  algorithm  around  a  weak  one 

We  begin  with  a  description  of  a  technique  by  which  the  accuracy  of  any  algorithm  can 
be  boosted  by  a  small  but  significant  amount.  Later,  we  will  show  how  this  mechanism  can 
be  applied  recursively  to  make  the  error  arbitrarily  small. 

3.1  The  Hypothesis  Boosting  Mechanism 

Let  A  be  an  algorithm  that  produces  with  high  probability  an  hypothesis  a-close  to  t  he  target 
concept  c.  We  sketch  an  algorithm  A'  that  simulates  A  on  three  different  distributions,  and 
outputs  an  hypothesis  significantly  closer  to  c. 
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Let  EX  be  the  given  examples  oracle,  and  let  D  be  the  distribution  on  Xn  induced  by 
EX.  The  algorithm  A'  begins  by  simulating  A  on  the  original  distribution  D\  =  D ,  using 
the  given  oracle  EX i  =  EX.  Let  hi  be  the  hypothesis  output  by  A. 

Intuitively,  A  has  found  some  weak  advantage  on  the  original  distribution;  this  advantage 
is  expressed  by  h j.  To  force  A  to  learn  more  about  the  “harder”  parts  of  the  distribution, 
we  must  somehow  destroy  this  advantage.  To  do  so,  A'  creates  a  new  distribution  D 2  under 
which  an  instance  chosen  according  to  D 2  has  a  roughly  equal  chance  of  being  correctly  or 
incorrectly  classified  by  hj .  The  distribution  D2  is  simulated  by  filtering  the  examples  chosen 
according  to  D  by  EX.  To  simulate  D2,  a  new  examples  oracle  EX 2  is  constructed.  When 
asked  for  an  instance,  EX 2  first  flips  a  fair  coin:  if  the  result  is  “heads,”  then  EX 2  requests 
examples  from  EX  until  one  is  chosen  for  which  hx(v)  =  c(u);  otherwise,  EX2  waits  for  an 
instance  to  be  chosen  for  which  h\(v)  ^  c(v).  (Later  we  show  how  to  prevent  EX 2  from 
having  to  wait  too  long  in  either  of  these  loops  for  a  desired  instance.)  The  algorithm  A  is 
again  simulated,  this  time  providing  A  with  examples  chosen  by  EX 2  according  to  D2.  Let 
h2  be  the  output  hypothesis. 

Finally,  D3  is  constructed  by  filtering  from  D  those  instances  on  which  hx  and  h2  agree. 
That  is,  a  third  oracle  EX 3  simulates  the  choice  of  an  instance  according  to  Z)3  by  requesting 
instances  from  EX  until  one  is  found  for  which  h\(v)  ^  h2(v).  (Again,  we  will  later  show 
how  to  limit  the  time  spent  waiting  in  this  loop  for  a  desired  instance.)  For  a  third  time, 
algorithm  A  is  simulated  with  examples  drawn  this  time  by  EX3,  producing  hypothesis  /i3. 

At  last,  A!  outputs  its  hypothesis  h:  given  an  instance  v,  if  hx(v)  =  h2(v)  then  h  predicts 
the  agreed  upon  value;  otherwise,  h  predicts  h3(v).  (In  other  words,  h  takes  “majority  vote” 
of  hi,  h2  and  h3.)  Later,  we  show  that  h's  error  is  bounded  by  g(a)  =  3a2  —  2a3.  This 
quantity  is  significantly  smaller  than  the  original  error  a,  as  can  be  seen  from  its  graph 
depicted  in  Figure  1.  (The  solid  curve  is  the  function  g ,  and,  for  comparison,  the  dotted  line 
shows  a  graph  of  the  identity  function.) 

3.2  A  Strong  Learning  Algorithm 

An  idea  that  follows  naturally  is  to  treat  the  previously  described  procedure  as  a  subroutine 
for  recursively  boosting  the  accuracy  of  weaker  hypotheses.  The  procedure  is  given  a  desired 
error  bound  e  and  a  confidence  parameter  <5,  and  constructs  an  e-close  hypothesis  from 
weaker,  recursively  computed  hypotheses.  If  e  >  |  —  then  an  assumed  weak  learning 
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Figure  1:  A  graph  of  the  function  <7(0)  =  3a2  —  2a3. 

algorithm  can  be  used  to  find  the  desired  hypothesis;  otherwise,  an  e-close  hypothesis  is 
computed  recursively  by  calling  the  subroutine  with  e  set  to  g~1(e). 

Unfortunately,  this  scheme  by  itself  does  not  quite  work  due  to  a  technical  difficulty: 
because  of  the  way  EX 2  and  EX3  are  constructed,  examples  may  be  required  from  a  very 
small  portion  of  the  original  distribution.  If  this  happens,  the  time  spent  waiting  for  an 
example  to  be  chosen  from  this  region  may  be  great.  Nevertheless,  we  will  see  that  this 
difficulty  can  be  overcome  by  explicitly  checking  that  the  errors  of  hypotheses  hx  and  h2  on 
D  are  not  too  small. 

Figure  2  shows  a  detailed  sketch  of  the  resulting  strong  learning  algorithm  Learn.  The 
procedure  takes  an  error  parameter  e  and  a  confidence  parameter  6.  and  is  also  provided  with 
an  examples  oracle  EX.  The  procedure  is  required  to  return  an  hypothesis  whose  error  is  at 
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Input:  error  parameter  e 

confidence  parameter  5 

examples  oracle  EX 

(implicit)  size  parameters  s  and  n 

Return:  hypothesis  h  that  is  e-close  to  the  target  concept  c  with  probability  >1  —  5 
Procedure: 

jy  then  return  WeakLearn(5,  EX) 

a  *-  p_1(c) 

EX1  *-  EX 

hi  *—  Learn(c*,  |5,  EXi) 

Ti^\t 

let  ai  be  an  estimate  of  ai  =  Pr„6D[^i(v)  ^  c(t>)]: 

choose  a  sample  sufficiently  large  that  |ai  —  ai|  <  T\  with  probability  >  1  —  |5 
if  fir  <  e  —  ri  then  return  h j 
defun  EX2() 

{  flip  coin 

if  “heads,”  return  the  first  instance  v  from  EX  for  which  hx(v)  =  c(u) 
else  return  the  first  instance  v  from  EX  for  which  hi(v)  ^  c( v)  } 

h2  *—  Learn(a,  |5,  EX2) 

t2  4-  1(1  -2a)e 

let  e  be  an  estimate  of  e  =  Prv6£>[h2(v)  ^  c(t>)]: 

choose  a  sample  sufficiently  large  that  |e  —  e|  <  r2  with  probability  >  1  —  |5 
if  e  <  e  —  r2  then  return  h2 
defun  EX3( ) 

{  return  the  first  instance  v  from  EX  for  which  hi(v)  ^  h2(v)  } 
h3  4—  Learn  (a,  1 5,EX3) 
defun  h(v ) 

{  6j  4—  h\(v),  b-2  4—  h2(v) 

if  bi  =  b2  then  return  t>i 
else  return  h3(v)  } 

return  h 


Figure  2:  A  strong  learning  algorithm  Learn. 

most  c  with  probability  at  least  1-5.  In  the  figure,  p  is  a  polynomial  and  WeakLearn(5,  EX) 
is  an  assumed  weak  learning  procedure  that  outputs  an  hypothesis  -close  to  the 

target  concept  c  with  probability  at  least  1—5.  As  above,  g{a)  is  the  function  3c*2  —  2c*3, 
and  the  variable  a  is  set  to  the  value  g -1(e).  Also,  the  quantities  a\  and  e  are  estimates 
of  the  errors  of  h\  and  h2  under  the  given  distribution  D.  These  estimates  are  made  with 
error  tolerances  and  r2  (defined  in  the  figure),  and  are  computed  in  the  obvious  manner 
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based  on  samples  drawn  from  EX;  the  required  size  of  these  samples  can  be  determined,  for 
instance,  using  Chernoff  bounds  [22].  The  parameters  s  and  n  are  assumed  to  be  known 
globally. 

Note  that  Learn  is  a  procedure  taking  as  one  of  its  inputs  a  function  (EX)  and  returning  as 
output  another  function  (h,  a  hypothesis,  which  is  treated  like  a  procedure).  Furthermore, 
to  simulate  new  example  oracles,  Learn  must  have  a  means  of  dynamically  defining  new 
procedures  (as  is  allowed,  for  instance,  by  most  Lisp-like  languages).  Therefore,  in  the 
figure,  we  have  used  the  somewhat  non-standard  keyword  defun  to  denote  the  definition  of 
a  new  function;  its  syntax  calls  for  a  name  for  the  procedure,  followed  by  a  parenthesized 
list  of  arguments,  and  the  body  indented  in  braces.  Static  scoping  is  assumed. 

Learn  works  by  recursively  boosting  the  accuracy  of  its  hypotheses.  Learn  typically 
calls  itself  three  times  using  the  three  simulated  example  oracles  described  in  the  prevwiing 
section.  On  each  recursive  call,  the  required  error  bound  of  the  constructed  hypotheses  con.es 
closer  to  i;  when  this  bound  reaches  ^  ,  the  weak  learning  algorithm  WeakLearn  can 

be  used. 

The  procedure  takes  measures  to  limit  the  run  time  of  the  simulated  oracles  il  provides 
on  recursive  calls.  When  Learn  calls  itself  a  second  time  to  find  h 2,  the  expected  number  of 
iterations  of  EX2  to  find  an  example  depends  on  the  error  of  hi,  which  is  estimated  by  «i. 
If  hx  already  has  the  desired  accuracy  1  —  f,  then  there  is  no  need  to  find  h2  and  /i3  since 
hx  is  a  sufficiently  good  hypothesis;  otherwise,  if  <i]  =  0(e),  then  it  can  be  shown  that  EX2 
will  not  loop  too  long  to  find  an  instance.  Similarly,  when  Learn  calls  itself  to  find  h 3,  the 
expected  number  of  iterations  of  EX3  depends  on  how  often  hi  and  h2  disagree,  which  we 
will  see  is  in  turn  a  function  of  the  error  of  h2  on  the  original  distribution  D.  If  this  error  r 
(which  is  estimated  by  e)  is  small,  then  h2  is  a  good  hypothesis  and  is  returned  by  Learn. 
Otherwise,  it  will  be  shown  that  EX3  also  will  not  run  for  too  long. 

3.3  Correctness 

We  show  in  this  section  that  the  algorithm  is  correct  in  the  following  sense: 

Theorem  3.2  For  0  <  <  <  |  and  for  0  <  6  <  1,  the  hypothesis  returned  by  calling 
Learn(c ,  8,  EX)  is  e-close  to  the  target  concept  with  probability  at  least  1  —  6. 


10 


Figure  3:  The  distributions  Dx  and  D 2. 


Proof:  Proof  is  by  induction  starting  at  the  bottom  of  the  recursion.  (Technically,  the 
induction  is  on  B(e,p(n,  a)),  where  B  is  a  function  defined  in  the  next  section.)  The  base 
case  that  e  >  |  follows  trivially  from  our  assumptions  about  WeakLearn.  Another 

easy  case  is  that  cii  or  e  is  found  to  be  smaller  than  e  —  tx  or  e  —  t2,  respectively.  In  either 
case,  it  follows  immediately,  due  to  the  accuracy  with  which  ax  and  e  have  been  estimated, 
that  the  returned  hypothesis  is  e-close  to  the  target  concept. 

Otherwise,  all  three  hypotheses  must  be  found  and  combined.  Let  ax  be  the  error  of  /i, 
under  D,.  Here,  D  is  the  distribution  of  the  provided  oracle  EX,  and  D{  is  the  distribution 
induced  by  oracle  EX,  on  the  fth  recursive  call  (?'  —  1,2,3).  By  inductive  hypothesis,  a,  <  a 
with  probability  1  —  ~6. 

In  the  special  case  that  all  hypotheses  are  deterministic,  the  distributions  Dx  and  D2 
can  be  depicted  schematically  as  shown  in  Figure  3.  The  figure  shows  the  portion  of  each 
distribution  for  which  the  hypotheses  hx  and  h2  agree  with  the  target  concept  c.  For  each 
distribution,  the  top  crosshatched  bar  represents  the  relative  fraction  of  the  instance  space 
for  which  hx  agrees  with  c;  the  bottom  striped  bar  represents  those  instances  for  which  h,2 
agrees  with  c.  Although  only  valid  for  deterministic  hypotheses,  this  figure  may  be  helpful 
for  motivating  one’s  intuition  in  what  follows. 


Let  Pi(v)  =  Pr[/i;(u)  ^  c(u)]  be  the  chance  that  some  fixed  instance  v  is  misclassi- 
fied  by  hi.  (Recall  that  hypotheses  may  be  randomized,  and  therefore  it  is  necessary  to 
consider  the  probability  that  a  particular  fixed  instance  is  misclassified.)  Similarly,  let 
q(v)  =  Pr[/ii(v)  ^  /t2(n)]  be  the  chance  that  v  is  classified  differently  by  hi  and  h2.  Also 
define  w,  x,  y ,  and  z  as  follows: 

w  =  ^  hM  =  c(u)l 

v€D 

X  =  ?T[hi{v)  =  h2{v)  =  c(«)j 

v€D 

v  =  ^  M»)  =  c(»)] 

v£D 

z  =  Pr  [hi(v)  =  h2(v)  ±  c(v)] 

v€D 


Clearly, 

w  -fi  x  =  Pr  f/ii(n)  =  c(u)]  =  1  —  a,, 

vGD 

a) 

and  since  c,  A,  and  h2  are  Boolean, 

y  +  *  =  P%lAM  ^  e(*0J  =  ai- 

12) 

In  terms  of  these  variables,  we  can  express  explicitly  the  chance  that  EXi  returns  instance 
v : 


Di(v)  = 

D(v) 

(3) 

D2(v)  = 

D{v)  t p,(v)  .  1  - 
2  \  a,  ‘  1  -  a,  ) 

(4) 

DM  = 

D(v)q(v ) 
w  +  y 

(5) 

Equation  (3)  is  trivial.  To  see  that  (4)  holds,  note  that  the  chance  that  the  initial  coin  flip 
comes  up  “tails”  is  |,  and  the  chance  that  instance  v  is  the  first  instance  misclassified  by  /), 
is  D(v)pi(v)/ai.  The  case  that  the  coin  comes  up  “heads”  is  handled  in  a  similar  fashion, 
as  is  the  derivation  of  equation  (5). 

From  equation  (4),  we  have  that 


=  £  02(»)(1 -P2(*0) 

=  2^“  53  0(v)pi(v)(l  -  Pa(t>))  +  2 77~~T  S  D(v)(]-  ~  Pi(v))(!  ~  Pal*')) 

*  V£Xn  '  *  '  v£Xn 

_y_  * 

2a,  2(1  -  a,) 


12 


Combining  equations  (1),  (2)  and  (6),  we  see  that  the  values  of  w,  x  and  z  can  be  written 
explicitly  in  terms  of  y,  cq  and  a 2.  (Note  that  (6)  could  also  have  been  derived  from  Figure  3 
in  the  case  of  deterministic  hypotheses:  if  is  as  shown  in  the  figure,  then  it  is  not  hard  to 
see  that  y  =  2 ax^  and  x  —  2(1  —  ax)(l  —  a2  —  /3).  These  imply  (6).) 

Finally,  using  equation  (5),  we  are  ready  to  compute  the  error  of  the  output  hypothesis 
h: 

Pr  [(Mt>)  =  Mw)  #  c(u))  v  ( hi(v )  ^  h2(v)  A  h3(v)  c(n))] 

v6  D 

z+  Y  D(V)<i(V)P3(v) 

v£Xn 

Z+  Y,(W  +  y)D3(v)P3(v) 

v$Xn 

z  +  a3(w  +  y) 

y 

Qi  —  o. 3  +  0103  +  2a2a3  —  2aia2a3  H - (a3  —  aj) 

a  1 

3a2  —  2a3  =  <7(0)  =  t 

as  desired.  The  inequality  here  follows  with  some  care  from  the  facts  that  each  a,  <  a,  and 
that  y  <  a\. 

Finally,  note  that  the  confidence  parameter  6  has  been  “spread  around”  so  that  the 
overall  chance  of  anything  “going  wrong”  is  at  most  6.  ■ 

3.4  Analysis 

In  this  section,  we  argue  that  Learn  runs  in  polynomial  time.  Here  and  throughout  this 
section,  unless  stated  otherwise,  polynomial  refers  to  polynomial  in  n,  s,  1/e  and  l/S.  Our 
approach  will  be  to  first  derive  a  bound  on  the  expected  running  time  of  the  procedure,  and 
to  then  use  a  part  of  the  confidence  8  to  bound  with  high  probability  the  actual  running  time 
of  the  algorithm.  Thus,  we  will  have  shown  that  the  procedure  is  probably  fast  and  correct, 
completing  the  proof  of  Theorem  3.1.  (Although  technically  we  only  show  that  Learn  halts 
probabilistically,  by  the  results  of  Haussler  et  al.  [10],  the  procedure  can  easily  be  converted 
into  a  learning  algorithm  that  halts  deterministically  in  polynomial  time.) 

We  will  be  interested  in  bounding  several  quantities.  First,  we  are  of  course  interested 
in  bounding  the  expected  running  time  T(e,8)  of  Learn(e,  8,  EX).  This  running  time  in 
turn  depends  on  the  time  U(e,8)  to  evaluate  an  hypothesis  returned  by  Learn,  and  on  the 
expected  number  of  examples  M(e,  8)  needed  by  Learn.  In  addition,  let  t(8),  u(8 )  and  m(<5) 


Pr  ih(v)  ±  c(u)l  = 
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be  analogous  quantities  for  WeakLearn(6,  EX).  By  assumption,  t,  u  and  m  are  polynomial!}' 
bounded.  Also,  all  of  these  functions  depend  implicitly  on  n  and  a. 

As  a  technical  point,  we  note  that  the  expectations  denoted  by  T  and  M  are  taken  only 
over  “good”  runs  of  Learn.  That  is,  the  expectations  are  computed  given  the  assumption 
that  every  sub-hypothesis  and  every  estimator  is  successfully  computed  with  the  desired 
accuracy.  By  Theorem  3.2,  this  will  be  the  case  with  probability  1—6. 

It  is  also  important  to  point  out  that  T  (respectively,  t)  is  the  expected  running  time 
of  Learn  (WeakLearn)  when  called  with  an  oracle  EX  that  provides  examples  in  unit  time 
Our  analysis  will  take  into  account  the  fact  that  the  simulated  oracles  supplied  to  Learn  or 
WeakLearn  at  lower  levels  of  the  recursion  do  not  in  general  run  in  unit  time. 

We  will  see  that  T,  U  and  M  are  all  exponential  in  the  depth  of  the  recursion  induced 
by  calling  Learn.  We  therefore  begin  by  bounding  this  depth.  Let  B(e,p)  be  the  smallest 
integer  i  for  which  g ’  Q  —  ^  <  e.  On  each  recursive  call,  e  is  replaced  by  <7-1(e).  Tims  ihe 
depth  of  the  recursion  is  bounded  by  B(c,p(n,s)).  We  have: 

Lemma  3.1  The  depth  of  the  recursion  induced  by  calling  Learn (e,6,  EX)  is  at  most 
B(e,p(n,s))  =  0(log(p(n,  s))  +  log  log(l/e)). 

Proof: 

We  can  say  B(e,p(n,s))  <  6  +  c  if  gb  (5  —  \  and  gc  (i)  <  e.  Clearly,  g(x)  <  3x2 

and  so  g'{x )  <  (3x)2'.  Thus,  it  suffices  to  choose  c  =  jig  log4/3(l/e)| .  Similarly,  if  \  <  x  <  | 
then  \  —  g(x)  =  —  x)  (1  +  2r  —  2x2)  >  y  —  x)  .  Thus,  the  proof  is  completed  by 

choosing  b=  [logu/g  (}p(n,a))|.  ■ 

For  the  remainder  of  this  analysis,  we  let  p  =  p(n,s)  and,  where  clear  from  context,  let 
B  =  B(e,p). 

We  show  next  that  U  is  polynomially  bounded.  This  is  important  because  we  require 
that  the  returned  hypothesis  be  polynomially  evaluatable. 

Lemma  3.2  The  time  to  evaluate  an  hypothesis  returned  by  Learn(e,  6,  EX)  is  U(t,6)  — 
0(3®  •  u(6/5s)). 

Proof:  If  f  >  2  ~  p>  ^en  Learn  returns  an  hypothesis  computed  by  WeakLearn.  In  this 
case,  U(e,6 )  =  u(S).  Otherwise,  the  hypothesis  returned  by  Learn  involves  the  computation 
of  at  most  three  sub-hypotheses.  Thus, 

L(f,6)<3.(/(»-,(e)46)-HO(l). 
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Solving  this  easy  recurrence,  we  arrive  at  the  stated  time  bound,  p 

When  an  example  is  requested  of  a  simulated  oracle  on  one  of  Learn’s  recursive  calls, 
that  oracle  must  itself  draw  several  examples  from  its  own  oracle  EX.  For  instance,  on  the 
third  recursive  call,  the  simulated  oracle  must  draw  instances  until  it  finds  one  on  which  hi 
and  /t2  disagree.  Naturally,  the  running  time  of  Learn  depends  on  how  many  examples  must 
be  drawn  in  this  manner  by  the  simulated  oracle.  The  next  lemma  bounds  this  quantity. 

Lemma  3.3  Let  r  be  the  expected  number  of  examples  drawn  from  EX  by  any  oracle  EXi 
simulated  by  Learn  when  asked  to  provide  a  single  example.  Then  r  <  4/ 1. 


Proof:  When  Learn  calls  itself  the  first  time  (to  find  h\),  the  examples  oracle  EX  it  was 
passed  is  left  unchanged.  In  this  case,  r  =  1. 

The  second  time  Learn  calls  itself,  the  constructed  oracle  EX2  loops  each  time  it  is  called 
until  it  receives  a  desirable  example.  Depending  on  the  result  of  the  initial  coin  flip,  we 
expect  EX 2  to  loop  1/aj  or  1/(1  —  a  1 )  times.  Note  that  if  a,  <  e  —  2 rt  =  |e  then,  based  on 
its  estimate  of  ax,  Learn  would  have  simply  returned  h{  instead  of  making  a  second  or  third 
recursive  call.  Thus,  we  can  assume  <  a,  <  i,  and  so  r  <  3/e  in  this  case. 

Finally,  when  Learn  calls  itself  the  third  time,  we  expect  the  constructed  oracle  EX 3  to 
loop  1  /(w  +  y)  times  before  finding  a  suitable  example.  (Here,  the  variables  iv ,  r,  y  and  2 
are  as  defined  in  the  proof  of  Theorem  3.2.)  It  remains  ihen  only  to  show  that  w  +  y  >  |e. 
Note  that  the  error  e  of  h2  on  the  original  distribution  D  is  w  +  2.  Thus,  using  this  fact  and 
equations  (1),  (2)  and  (6).  we  can  solve  explicitly  for  w  and  y  in  terms  of  e,  and  a2,  and 
so  find  that 


w  +  y  =  a,  -f 


t  —  1*7  (I  —  n  1 


<h  + 


e  —  deqa/l  —  ai) 


(7) 


1  —  2  c'i  1  —  2a  1 

To  lower  bound  w  +  y,  we  will  find  the  minimum  of  this  second  function  on  the  interval 
[0, a].  Differentiating  (with  respect  to  «))  wo  see  that  the  function  has  at  most  one  critical 
point  less  than  and  we  note  further  that  such  a  critical  point  cannot  be  minimal  since  the 
function  tends  to  —00  as  o j  — >  —00.  This  means  that  the  function’s  minimum  on  any  closed 
subinterval  of  (—00.  is  achieved  at  one  endpoint  of  the  subinterval.  In  particular,  for  the 
subinterval  of  interest  to  us,  the  function  achieves  its  minimum  either  when  cq  =  0  or  when 
ax  =  a. 

We  can  assume  that  e  >  <  —  2t2  =  ( ^  -r  ^0)  r.  otherwise,  if  e  were  smaller  than  this 
quantity,  then  Learn  would  have  returned  .'/2  rather  than  going  on  to  compute  /? 3 .  Thus,  if 


ai  =  0  then  w  +  y  >  e  >  ^e,  and  if  a\  =  a  then  w  4-  y  >  jo(4  —  7 a  +  2a2)  >  >  if. 

Lemma  3.4  The  expected  number  of  examples  M(e,6)  needed  by  Learn(e,  6,  EX) 
'36s 


O  .  (Bp2  +  p2log(l/<5)  4-  m(6/5s))^  ■ 


Proof:  In  the  base  case  that  e  >  i  —  i.  Learn  simply  calls  WeakLearn,  so  we  have  A/(t,  6)  — 
m(6).  On  each  of  the  recursive  calls,  the  simulated  oracle  is  required  to  provide  M{g~ 
examples.  To  provide  one  such  example,  the  simulated  oracle  must  itself  draw  at  most 
an  average  of  4/e  examples  from  EX.  Thus,  each  recursive  call  demands  at  most  <4/e)  • 
A f(g~1(e),jS)  examples  on  average. 

In  addition,  Learn  requires  some  examples  for  making  its  estimates  at  and  e.  Using 
Chernoff  bounds  [22],  we  can  show  that  a  sample  of  size  0(p2  log(l/6)/e2)  suffices. 

We  thus  arrive  at  the  recurrent  equation: 


M{e,6)<--M{g-\e),\6)  +  0 
e  3 


/pMogQ/j)^ 


Making  use  of  the  fact  that  g  1  (e)  >  \J\e  and  that  B(g  1(e),p)  =  B(e,p)  —  1,  we  can  solve 
this  recurrence  and  arrive  at  the  stated  bound.  ■ 


Lemma  3.5  The  expected  execution  time  of  Learn (e,6,EX)  is  given  by  T(e,S)  — 
O  ^3S  •  t(6/ 5s)  4-  l0-8-~£S/5Bl  ■  ( Bp 2  +  p 2  log(l/6)  4-  m(6/5s))^  . 

Proof:  As  in  the  previous  lemmas,  the  base  case  that  e  >  i  —  i  is  easily  handled.  In  this 
case,  T(t ,  6 )  =  <(£). 

Otherwise,  Learn  takes  time  3  •  T(g~l(e),^S)  on  its  three  recursive  calls.  In  addition, 
Learn  spends  time  drawing  examples  to  make  the  estimates  aj  and  e,  and  overhead  time  is 
also  spent  by  the  simulated  examples  oracles  passed  on  the  three  recursive  calls.  A  typical 
example  that  is  drawn  from  Learn’s  oracle  EX  is  evaluated  on  zero,  one  or  two  of  the 
previously  computed  sub-hypotheses.  For  instance,  an  example  drawn  for  the  purpose  of 
estimating  a\  is  evaluated  once  by  h\;  an  example  drawn  for  the  simulated  oracle  EX 3  is 
evaluated  by  both  h\  and  h2.  Thus,  Learn ’s  overhead  time  is  proportional  to  the  product 
of  the  total  number  of  examples  needed  by  Learn  and  the  time  it  takes  to  evaluate  a  sub¬ 
hypothesis  on  one  of  these  examples.  Therefore,  the  following  recurrence  holds: 

TM)  <  3  •  T{g  -'(e),\6)  4-  O  (i'(g'\c).  \6)  •  A/M))  !*) 
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Applying  Lemmas  3.2  and  3.4,  this  recurrence  implies  the  stated  bound.  ■ 

The  main  result  of  this  section  follows  immediately: 

Theorem  3.3  Let  0  <  e  <  |  and  let  0  <  8  <  1.  With  probability  at  least  1  —6,  the  execution 
of  Learn(e,  ~S,  EX)  halts  in  polynomial  time  and  outputs  an  hypothesis  e-close  to  the  target 
concept. 

Proof:  The  chance  that  Learn  does  not  output  an  hypothesis  e-close  to  c  is  at  most  1 6 . 
The  chance  that  such  an  hypothesis  is  output  after  time  (2/6)  ■  T(e,  is  also  at  most  ^<5.  ■ 

3.5  Space  Complexity 

Although  not  of  immediate  consequence  to  the  proof  of  Theorem  3.3,  it  is  worth  pointing 
out  that  Learn’s  space  requirements  are  relatively  modest,  as  proved  in  this  section. 

Let  5(e,£)  be  the  space  used  by  Learn(e,<$,  EX);  let  Q(e,6)  be  the  space  needed  to  store 
an  output  hypothesis;  and  let  R(e,6)  be  the  space  needed  to  evaluate  such  an  hypothesis. 
Let  s(6),  q(6)  and  r(6)  be  analogous  quantities  for  WeakLearn(<5,  EX).  Then  we  have: 

Lemma  3.6  The  space  Q(e,6)  required  to  store  an  hypothesis  output  by  Learn (e,6,EX) 
is  at  most  0( 3s  •  q(6/58)).  The  space  R(e,6)  needed  to  evaluate  such  an  hypothesis  is 
0(B  +  r(6/b8)).  Finally,  the  total  space  S(e,6)  required  by  Learn  is  0( 3B  •  q(6 / 5s)  + 
s(6/b8)  +  B-r(6/bB)). 

Proof:  For  e  >  ^  —  i,  the  bounds  are  trivial.  Otherwise,  the  following  recurrences  are  easy 
to  derive  bounding  Q  and  R: 

Q(e,6)<3-Q(g-'(e),l6)  +  0(\) 

R(e,6)<R(g-'(e),];6)  +  0(\) 

For  bounding  5,  note  that  the  space  required  by  Learn  is  dominated  by  the  storage  of 
the  sub-hypotheses,  by  their  recursive  computation,  and  by  the  space  needed  to  evaluate 
them.  Thus, 

S(t,6)  <  S(g-’ (£),  $*)  +  O  (<?(»-'«,  If)  +  R(g-'(e). \6)) 

which  implies  the  desired  bound.  ■ 
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4  Improving  Learn’s  Time  and  Sample  Complexity 

In  this  section,  we  describe  a  modification  to  the  construction  of  Section  3  that  significantly 
impro%res  Learn’s  time  and  sample  complexity.  In  particular,  we  will  improve  these  complex 
ity  measures  by  roughly  a  factor  of  1/e,  giving  bounds  that  are  linear  in  1/e  (ignoring  log 
factors).  These  improved  bounds  will  have  some  interesting  consequences,  described  in  later 
sections. 

In  the  original  construction  of  Learn,  much  time  and  many  examples  are  squandered  by 
the  simulated  oracles  EX,  waiting  for  a  desirable  instance  to  be  drawn.  Lemma  3.3  showed 
that  the  expected  time  spent  waiting  is  0(1  /e).  The  modification  described  below  will  reduce 
this  to  0(l/a)  =  0( (Here,  a  =  as  before.) 

Recall  that  the  running  time  of  oracle  EX 2  depends  on  the  error  ax  of  the  first  sub¬ 
hypothesis  hi.  In  the  original  construction,  we  ensured  that  a  1  not  be  too  small  by  estimating 
its  value,  and,  if  smaller  than  (,  returning  hx  instead  of  continuing  the  normal  execution  of 
the  subroutine.  Since  this  approach  only  guarantees  that  ax  >  17(e),  there  does  not  seem 
to  be  any  way  of  ensuring  that  EX 2  run  for  o(l/e)  time.  To  improve  EX2’s  running  time 
then,  we  will  instead  modify  hi  by  deliberately  increasing  its  error.  Ironically,  this  intentional 
injection  of  error  will  have  the  effect  of  improving  Learn’s  worst  case  running  time  by  limiting 
the  time  spent  by  either  EX2  or  EX3  waiting  for  a  suitable  instance. 


4.1  The  Modifications 


Specifically,  here  is  how  Learn  is  modified.  Call  the  new  procedure  Learn'.  Following  the 
recursive  computation  of  hi,  Learn'  estimates  the  error  aj  of  hx,  although  less  accurately 
than  Learn.  Let  be  this  estimate,  and  choose  a  sample  large  enough  that  |aj  —  d]  | 
with  probability  at  least  1  -  |<$.  Since  0  <  <21  <  a,  we  can  assume  tvithout  loss  of  generality 
that  Ja  <  di  <  Ja. 


Next,  Learn'  defines  a  new  hypothesis  h\  as  follows:  given  an  instance  v,  h\  first  flips  a 
coin  biased  to  turn  up  “heads”  with  probability  exactly 


P  = 


jo  -  di 
1  -  Jo  -  ai 


If  the  outcome  is  “tails,”  then  h\  evaluates  hx(v)  and  returns  the  result.  Otherwise,  if 
“heads,”  h\  predicts  the  wrong  answer,  ->c{v).  Since  h\  will  only  be  used  during  the  training 
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phase,  we  can  assume  that  the  correct  classification  of  v  is  available,  and  thus  that  h\  can 
be  simulated. 

This  new  hypothesis  h\  is  now  used  in  place  of  hx  by  EX2  and  EX3.  The  rest  of  the 
subroutine  is  unmodified.  In  particular,  the  final  returned  hypothesis  h  is  unchanged  — 
that  is,  hi,  not  h[,  is  used  by  h. 

4.2  Correctness 

To  see  that  Learn'  is  correct,  note  first  that  the  error  of  h\  is  exactly  a\  =  (1  —  p)ai  +  p 
since  the  chance  of  error  is  ax  on  “tails,”  and  is  1  on  “heads.”  By  our  choice  of  p,  it  can  be 
verified  that  |a  <  a\  <  a. 

Let  h1  be  the  same  hypothesis  as  h,  except  with  h\  used  in  lieu  of  h\.  Note  that  h',  h\ ,  h2 
and  h3  are  related  to  one  another  in  exactly  the  same  way  that  h,  h\,  h2  and  h3  are  related 
in  the  original  proof  of  Theorem  3.2.  That  is,  if  we  imagine  that  h\  is  returned  on  the  first 
recursive  call  of  the  original  procedure  Learn,  then  it  is  not  impossible  that  h2  and  h3  would 
be  returned  on  the  second  and  third  recursive  calls,  in  which  case  h'  would  be  the  returned 
hypothesis.  Therefore,  by  the  same  argument  used  to  prove  Theorem  3.2,  the  error  of  h'  is 
at  most  g{a)  —  e. 

It  is  not  surprising,  and  is  not  hard  to  verify,  that  increasing  h i’s  chance  of  error  on 
instance  v  cannot  possibly  decrease  the  chance  that  h  misclassifies  v.  Therefore,  since 
Pv[h\(v)  7^  c(v)]  >  Prf/i^u)  ^  c(o)]  for  any  v,  the  error  of  h  cannot  be  greater  than  the  error 
h' ,  which  is  bounded  by  e. 

4.3  Analysis 

Next,  we  show  that  Learn'  runs  faster  using  fewer  examples  than  Learn.  We  use  essentially 
the  same  analysis  as  in  Section  3.4.  The  following  three  lemmas  are  modified  versions  of 
Lemmas  3.3,  3.4  and  3.5.  The  proofs  of  the  other  lemmas  apply  immediately  to  Learn'  with 
little  or  no  modification,  and  so  are  omitted. 

Lemma  4.1  Let  r  be  the  expected  number  of  examples  drawn  from  EX  by  any  oracle  EXi 
simulated  by  Learn'  when  asked  to  provide  a  single  example.  Then  r  <  4/a. 

Proof:  As  in  the  original  proof,  r  =  1  for  EX i-  We  expect  the  second  oracle  to  loop  at  most 
1  (a\  times  on  average.  Since  a't  >  -|a,  r  is  at  most  2/a  in  this  case. 
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Finally,  to  bound  the  number  of  iterations  of  EX3,  we  will  show  that  w  +  y  >  \<y  using 
equation  (7)  as  in  the  original  proof.  To  lower  bound  w  +  y ,  we  find  the  minimum  of  the  last 
formula  of  (7)  (with  ax  replaced  by  a\  of  course)  on  the  interval  [|a,  a).  As  noted  previously, 
the  function  must  achieve  its  minimum  at  one  endpoint  of  the  interval.  Assuming  as  in  the 
original  proof  that  e  >  c,  we  see  that  when  a\  =  a,  w  +  y  >  ja(4  —  7a -(-2a2)  dis¬ 
similarly,  when  ax  =  w  +  y  >  +  a3  +  ^a2/(l  -  a)  >  |a.  This  completes  the  proof.  ■ 

Lemma  4.2  The  expected  number  of  examples  M(e,6)  needed  by  Learn'(e,  6,  EX)  is 

°  {~T  '  3p 2  +  P2 log0/<5)  +  ™(<5/5S))^  • 

Proof:  The  proof  is  nearly  the  same  as  for  Lemma  3.4.  In  addition  to  incorporating  the 
superior  bound  given  by  Lemma  4.1  on  the  number  of  examples  needed  by  the  simulated 
oracles,  we  must  also  consider  the  number  of  examples  needed  to  estimate  ax  and  e.  The  first, 
ax,  can  be  estimated  using  a  sample  of  size  0(log(l/6)/o2)  =  0(log(l/6)/e).  By  estimating  e 
in  a  slightly  different  manner,  we  can  also  achieve  a  better  bound  on  the  sample  size  needed. 
Specifically,  we  can  choose  a  sample  large  enough  that,  with  probability  1  -  £6,  c  <  c  -  r2 
if  e  <  e  —  2 r2,  and  e  >  e  —  r2  if  e  >  c.  Such  an  estimate  has  all  of  the  properties  needed  by 
Learn',  but  only  requires  a  sample  of  size  0(p2  log(l/6)/e)  as  can  be  derived  using  Chernoff 
bounds  [22]. 

Thus,  we  arrive  at  the  recurrence 

A#M>  <  •  M(g-'(€)+)  +  0  (£MU+j 

which  implies  the  stated  bound.  ■ 

Lemma  4.3  The  expected  execution  time  of  Learn'(e,  6,  EX)  is  given  by  T(t,S)  = 
o  (38  .  f(«/5B)  +  108B - “(</5B)  •  (BpJ  +  p*  log(l/i)  +  m(«/5s))l . 

Proof:  This  bound  follows  from  the  recurrence  (8),  using  the  superior  bound  on  M  given 
by  Lemma  4.2.  ■ 

5  Variations  on  the  Learning  Model 

Next,  we  consider  how  the  main  result  relates  to  some  other  learning  models. 
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5.1  Group  Learning 

An  immediate  consequence  of  Theorem  3.1  concerns  group  leamability.  In  the  group  learning 
model,  the  learner  produces  a  hypothesis  that  need  only  correctly  classify  large  groups  of 
instances,  all  of  which  are  either  positive  or  negative  examples.  Kearns  and  Valiant  [14,  17] 
prove  the  equivalence  of  group  learning  and  weak  learning.  Thus,  by  Theorem  3.1,  group 
learning  is  also  equivalent  to  strong  learning. 

5.2  Miscellaneous  PAC  Models 

Haussler  et  al.  [10]  describe  numerous  variations  on  the  basic  PAC  model,  and  show  that  all  of 
them  are  equivalent.  For  instance,  they  consider  randomized  versus  deterministic  algorithms, 
algorithms  for  which  the  size  s  of  the  target  concept  is  known  or  unknown,  and  so  on.  It 
is  not  hard  to  see  that  all  of  their  equivalence  proofs  apply  to  weak  learning  algorithms  as 
well  (with  one  exception  described  below),  and  so  that  any  of  these  weak  learning  models 
are  equivalent  by  Theorem  3.1  to  the  basic  PAC-learning  model. 

The  one  reduction  from  their  paper  that  does  not  hold  for  weak  learning  algorithms 
concerns  the  equivalence  of  the  one-  and  two-oracle  learning  models.  In  the  one-oracle 
model  (used  exclusively  in  this  paper),  the  learner  has  access  to  a  single  source  of  positive 
and  negative  examples.  In  the  two-oracle  model,  the  learner  has  access  to  one  oracle  that 
returns  only  positive  examples,  and  another  returning  only  negative  examples.  The  authors 
show  that  these  models  are  equivalent  for  strong  learning  algorithms.  However,  their  proof 
apparently  cannot  be  adapted  to  show  that  one-oracle  weak  leamability  implies  two-oracle 
weak  leamability  (although  their  proof  of  the  converse  is  easily  and  validly  adapted).  This 
is  because  their  proof  assumes  that  the  error  e  can  be  made  arbitrarily  small,  clearly  a  bad 
assumption  for  weak  learning  algorithms.  Nevertheless,  this  is  not  a  problem  since  we  have 
shown  that  one-oracle  weak  iearnability  implies  one-oracle  strong  leamability,  which  in  turn 
implies  two-oracle  strong  (and  therefore  weak)  leamability.  Thus,  despite  the  inapplicability 
of  Haussler  et  al.’s  original  proof,  all  four  learning  models  are  equivalent. 

5.3  Fixed  Hypotheses 

Much  of  the  PAC-learning  research  has  been  concerned  with  the  form  or  representation  of 
the  hypotheses  output  by  the  learning  algorithm.  Clearly,  the  construction  described  in 
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Section  3  does  not  in  general  preserve  the  form  of  the  hypotheses  used  by  the  weak  learning 
algorithm.  It  is  natural  to  ask  whether  there  exists  any  construction  preserving  this  form. 
That  is,  if  concept  class  C  is  weakly  learnable  by  an  algorithm  using  hypotheses  from  a  class 
'H  of  representations,  does  there  then  exist  a  strong  learning  algorithm  for  C  that  also  only 
outputs  hypotheses  from  Til 

In  general,  the  answer  to  this  question  is  “no”  (modulo  some  relatively  weak  complexity 
assumptions).  As  a  simple  example,  consider  the  problem  of  learning  fc-term  DNF  formulas 
using  only  hypotheses  represented  by  k-term  DNF.  (A  formula  in  disjunctive  noimal  form 
(DNF)  is  one  written  as  a  disjunction  of  terms,  each  of  which  is  a  conjunction  of  literals, 
a  literal  being  either  a  variable  or  its  complement.)  Pitt  and  Valiant  [19]  show  that  this 
learning  problem  is  infeasible  if  RP  ^  NP. 

Nevertheless,  the  weak  learning  problem  is  solved  by  the  algorithm  sketched  below.  (A 
similar  algorithm  is  given  by  Kearns  [15].)  First,  choose  a  “large”  sample.  If  significantly 
more  than  half  of  the  examples  in  the  sample  are  negative  (positive),  then  output  the  “always 
predict  negative  (positive)”  hypothesis,  and  halt.  Otherwise,  we  can  assume  that  the  dis¬ 
tribution  is  roughly  evenly  split  between  positive  and  negative  examples.  Select  and  output 
the  disjunction  of  k  or  fewer  literals  that  misclassifies  none  of  the  positive  examples,  and  the 
fewest  of  the  negative  examples.  Working  through  the  details,  it  is  not  hard  to  show  that 
this  output  formula  is  correct  for  nearly  all  of  the  positive  examples  and  for  at  least  0(1  /nk) 
of  the  negative  examples.  Since  the  distribution  is  roughly  evenly  divided  between  positive 
and  negative  examples,  this  implies  that  the  output  hypothesis  is  roughly  Q  —  0(^f))-close 
to  the  target  formula. 

5.4  Queries 

A  number  of  researchers  have  considered  learning  scenarios  in  which  the  learner  is  not  only 
able  to  passively  observe  randomly  selected  examples,  but  is  also  able  to  ask  a  “teacher” 
various  sorts  of  questions  or  queries  about  the  target  concept.  For  instance,  the  learner  might 
be  allowed  to  ask  if  some  particular  instance  is  a  positive  or  negative  example.  Angluin  [2] 
describes  several  kinds  of  query  that  might  be  useful  to  the  learner.  The  purpose  of  this 
section  is  simply  to  point  out  that  the  construction  of  Section  3  is  applicable  even  in  the 
presence  of  most  kinds  of  query.  That  is,  a  weak  learning  algorithm  that  depends  on  the 
availability  of  certain  kinds  of  query  can  be  converted,  using  the  same  construction,  into  a 


strong  learning  algorithm  using  the  same  query  types. 


5.5  Many-Valued  Concepts 

In  this  paper,  we  have  only  considered  Boolean  valued  concepts,  i.e.,  concepts  that  classify 
every  instance  as  either  a  positive  or  a  negative  example.  Of  course,  in  the  “real  world,” 
most  learning  tasks  require  classification  into  one  of  several  categories  (for  instance,  character 
recognition).  How  does  the  result  generalize  to  handle  many- valued  concepts? 

First  of  all,  for  learning  a  fc-valued  concept,  it  is  not  immediately  clear  how  to  define 
the  notion  of  weak  learnability.  An  hypothesis  that  guesses  randomly  on  every  instance 
will  be  correct  only  1  /k  of  the  time,  so  one  natural  definition  would  require  only  that  the 
weak  learning  algorithm  classify  instances  correctly  slightly  more  than  1  /  k  of  the  time. 
Unfortunately,  under  this  definition,  strong  and  weak  learnability  are  inequivalent  for  k  as 
small  as  three.  As  an  informal  example,  consider  learning  a  concept  taking  the  values  0,  1  and 
2,  and  suppose  that  it  is  “easy”  to  predict  when  the  concept  has  the  value  2,  but  “hard”  to 
predict  whether  the  concept’s  value  is  0  or  1.  Then  to  weakly  learn  such  a  concept,  it  suffices 
to  find  an  hypothesis  that  is  correct  whenever  the  concept  is  2,  and  that  guesses  randomly 
otherwise.  For  any  distribution,  this  hypothesis  will  be  correct  half  of  the  time,  achieving 
the  weak  learning  criterion  of  accuracy  significantly  better  than  |.  However,  boosting  the 
accuracy  further  is  clearly  infeasible. 

Thus,  a  better  definition  of  weak  learnability  is  one  requiring  that  the  hypothesis  be 
correct  on  slightly  more  than  half  of  the  distribution,  regardless  of  k.  Using  this  definition, 
the  construction  of  Section  3  is  easily  modified  to  handle  many-valued  concepts. 

6  General  Complexity  Bounds  for  PAC  Learning 

The  construction  derived  in  Sections  3  and  4  yields  some  unexpected  relationships  between 
the  allowed  error  e  and  various  complexity  measures  that  might  be  applied  to  a  strong 
learning  algorithm.  One  of  the  more  surprising  of  these  is  a  proof  that,  for  every  learnable 
concept  class,  there  exists  an  efficient  algorithm  whose  output  hypotheses  can  be  evaluated 
in  time  polynomial  in  log(l/e).  Furthermore,  such  an  algorithm’s  space  requirements  are 
also  only  poly-logarithmic  in  1/e  —  far  less,  for  instance,  than  would  be  needed  to  store  the 
entire  sample.  In  addition,  its  time  and  sample  size  requirements  grow  only  linearly  in  1/e 
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(disregarding  log  factors). 


Theorem  6.1  If  C  is  a  leamable  concept  class,  then  there  exists  an  efficient  learning  algo¬ 
rithm  for  C  that: 

,  ,  .  pi(n,a,log(l/e),log(l/tf)) 

•  requires  a  sample  oj  size  ■ — - , 


•  runs  in  time 


P2(n,  s,  log(l/ e),  log(l/ 6)) 


•  uses  space  p3(n,  s,  log(l/e),  log(l/6)),  and 

•  outputs  hypotheses  of  size  p4(n,s,log(l/e),log(l/£)) 


for  some  polynomials  p\,  P2 ,  p$  and  p4. 


Proof:  Given  a  strong  learning  algorithm  A  for  C,  convert  A  into  a  weak  learning  algorithm 
A!  that  outputs  hypotheses  i-close  to  the  target  concept.  Now  let  A"  be  the  procedure 
obtained  by  applying  the  construction  of  Learn'  with  A'  plugged  in  for  WeakLearn.  Fur¬ 
thermore,  assume  without  loss  of  generality,  using  the  results  of  Haussler  et  al.  [10],  that  A” 
halts  deterministically  in  time  polynomial  in  log(l/6).  Then  the  lemmas  of  Sections  3  and  4 
imply  that  A"  has  all  of  the  stated  properties.  ■ 

The  remainder  of  this  section  is  a  discussion  of  some  of  the  consequences  of  Theorem  6.1. 


6.1  Improving  the  Performance  of  Existing  Algorithms 

These  bounds  can  be  applied  immediately  to  a  number  of  existing  learning  algorithms, 
yielding  improvements  in  time  and/or  space  complexity  (at  least  in  terms  of  e).  For  instance, 
the  computation  time  of  Blumer  et  al.’s  [4]  algorithm  for  learning  half-spaces  of  Rn .  which 
involves  the  solution  of  a  linear  programming  problem  of  size  proportional  to  the  sample,  can 
be  improved  by  a  polynomial  factor  of  1  /e.  The  same  is  also  true  of  Baum’s  [3]  algorithm  for 
learning  unions  of  half-spaces,  which  involves  finding  the  convex  hull  of  a  significant  fraction 
of  the  sample. 

There  are  many  more  algorithms  for  which  the  theorem  implies  improved  space  efficiency. 
This  is  especially  true  of  the  many  known  PAC  algorithms  that  work  by  choosing  a  large  sam¬ 
ple  and  then  finding  an  hypothesis  consistent  with  it.  For  instance,  this  is  how  Rivest’s  [20] 
decision  list  algorithm  works,  as  do  most  of  the  algorithms  described  by  Blumer  et  al.  [4],  as 
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well  as  Helmbold,  Sloan  and  Warmuth’s  [13]  construction  for  learning  nested  differences  of 
learnable  concepts.  Since  the  entire  sample  must  be  stored,  these  algorithms  are  not  terribly 
space  efficient,  and  so  can  be  dramatically  improved  by  applying  Theorem  6.1.  Of  course, 
these  improvements  typically  come  at  the  cost  of  requiring  a  somewhat  larger  sample  (by  a 
polynomial  factor  of  log(l/e)).  Thus,  there  appears  to  be  a  trade-off  between  sample  size 
and  space  (or  time)  complexity. 

6.2  Data  Compression 

Blumer  et  al.  [5,  4]  have  considered  the  relationship  between  learning  and  data  compression. 
They  have  shown  that,  if  any  sample  can  be  “compressed”  —  i.e.,  represented  by  a  prediction 
rule  significantly  smaller  than  the  original  sample  —  then  this  compression  algorithm  can 
be  converted  into  a  PAC-learning  algorithm. 

In  some  sense,  the  bound  given  in  Theorem  6.1  on  the  size  of  the  output  hypothesis 
implies  the  converse.  In  particular,  suppose  Cn  is  a  learnable  concept  class  and  that  we  have 
been  given  m  examples  (uj, c(ui)),  (vi,c(v2)), . . . ,  (um,  c(vm))  where  each  w,  €  Xn  and  c  is  a 
concept  in  Cn  of  size  s.  The  data  compression  problem  is  to  find  a  small  representation  for 
the  data,  i.e.,  an  hypothesis  h  that  is  significantly  smaller  than  the  original  data  set  with 
the  property  that  h(vi)  —  c(t>i)  for  each  V{.  An  hypothesis  with  this  last  property  is  said  to 
be  consistent  with  the  sample. 

Theorem  6.1  implies  the  existence  of  an  efficient  algorithm  that  outputs  consistent  hy¬ 
potheses  only  poly-logarithmic  in  the  size  m  of  the  sample.  This  is  proved  by  the  following 
theorem: 

Theorem  6.2  Let  C  be  a  learnable  concept  class.  Then  there  exists  an  efficient  algorithm 
that,  given  0  <  6  <  1  and  m  examples  of  a  concept  c  6  Cn  of  size  s,  outputs  with  probability 
at  least  1  —  8a  deterministic  hypothesis  consistent  with  the  sample  of  size  polynomial  in  n, 
s  and  logm. 

Proof:  Haussler  et  al.  [10]  show  how  to  convert  any  learim.g  algorithm  into  one  that  finds 
hypotheses  consistent  with  a  set  of  data  points.  The  idea  is  to  choose  e  <  1/m  and  to  run 
the  learning  algorithm  on  a  (simulated)  uniform  distribution  over  the  data  set.  Since  e  is 
less  than  the  weight  placed  on  any  element  of  the  sample,  the  output  hypothesis  must  have 
error  zero.  Applying  this  technique  to  a  learning  algorithm  A  satisfying  the  conditions  of 
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Theorem  6.1,  we  see  that  the  output  hypothesis  has  size  only  polynomial  in  n,  s  and  logm, 
and  so  is  far  smaller  than  the  original  sample  for  large  m. 

Technically,  this  technique  requires  that  the  learning  algorithm  output  deterministic  hy¬ 
potheses.  However,  probabilistic  hypotheses  can  also  be  handled  by  choosing  a  somewhat 
smaller  value  for  c,  and  by  “hard-wiring”  the  computed  probabilistic  hypothesis  with  a  se¬ 
quence  of  random  bits.  More  precisely,  set  t  —  l/2m2,  and  run  A  over  the  same  distribution 
as  before.  Then  with  high  probability,  the  computed  hypothesis  h  has  chance  of  error  at 
most  1  /2m  on  any  one  of  the  m  examples  in  the  sample.  Now  note  that  h  can  be  regarded 
as  a  deterministic  function  of  an  instance  v  and  a  sequence  of  random  bits  r.  If  r  is  such 
a  sequence,  then  the  chance  that  h(-,r)  correctly  classifies  all  of  the  m  examples  is  at  least 
A.  Thus,  choosing  and  testing  random  sequences  r,  we  can  quickly  find  one  for  which  the 
deterministic  hypothesis  h(-,r)  is  consistent  with  the  sample.  Finally,  note  that  the  size  of 
this  output  hard-wired  hypothesis  is  bounded  by  |h|  +  |r|,  and  that  |r|  is  bounded  by  the 
time  it  takes  to  evaluate  h,  which  is  poly-logarithmic  in  m.  H 

Naturally,  the  notion  of  size  in  the  preceding  theorem  depends  on  the  underlying  model  of 
computation,  which  has  deliberately  been  left  unspecified.  However,  the  theorem  has  some 
immediate  corollaries  when  the  learning  problem  is  discrete,  i.e.  when  every  instance  in  the 
domain  Xn  is  encoded  using  a  finite  alphabet  by  a  string  of  length  presumably  bounded  by 
a  polynomial  in  n. 

Corollary  6.1  Let  C  be  a  leamable  discrete  concept  class.  Then  there  exists  an  efficient 
algorithm  that,  given  0  <  8  <  1  and  a  sample  as  in  Theorem  6.2,  outputs  a  deterministic 
consistent  hypothesis  of  size  polynomial  in  n  and  s,  and  independent  of  m. 

Proof:  The  size  m  of  the  sample  is  clearly  bounded  by  |Xn|.  Since  log  |Xn|  is  bounded  by 
a  polynomial  <n  »,  the  corollary  follows  immediately.  ■ 

Applying  “Occam’s  Razor”  of  Blumer  et  al.  [5],  this  implies  the  following  strong  general 
bound  on  the  sample  size  needed  to  efficiently  learn  C.  Although  the  bound  is  better  than 
that  given  by  Theorem  6.1  (at  least  in  terms  of  e),  it  should  be  pointed  out  that  this 
improvement  requires  the  sacrifice  of  space  efficiency  since  the  entire  sample  must  be  stored. 


Theorem  6.3  Let  C  be  a  leamable  discrete  concept  class.  Then  there  exists  an  efficient 
learning  algorithm  for  C  requiring  a  sample  of  size  0 


mial  p. 


' 


for  some  polyno- 
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6.3  Hard  Functions  are  Hard  to  Learn 


Theorem  6.1’s  bound  on  the  size  of  the  output  hypothesis  also  implies  that  any  hard-to- 
evaluate  concept  class  is  unlearnable.  Although  this  result  does  not  sound  surprising,  it 
was  previously  unclear  how  it  might  be  proved:  since  a  learning  algorithm’s  hypotheses  are 
technically  permitted  to  grow  polynomially  in  1/e,  the  learnability  of  such  classes  did  not 
seem  out  of  the  question. 

This  result  yields  the  first  representation-independent  hardness  results  not  based  on  cryp¬ 
tographic  assumptions.  For  instance,  assuming  P/poly  ^  NP/poly,  the  class  of  polynomial- 
size,  nondeterministic  Boolean  circuits  is  not  leamable.  (The  set  P/poly  (NP/poly)  consists 
of  those  languages  accepted  by  a  family  of  polynomial-size  deterministic  (nondeterministic) 
circuits.)  Furthermore,  since  learning  pattern  languages  was  recently  shown  [21]  to  be  as 
hard  as  learning  NP/poly,  this  result  shows  that  pattern  languages  are  also  unlearnable 
under  this  relatively  weak  structural  assumption. 

Theorem  6.4  Suppose  C  is  leamable,  and  assume  that  Xn  —  {0,1}”.  Then  there  exists  a 
polynomial  p  such  that  for  all  concepts  c  6  Cn  of  size  s,  there  exists  a  circuit  of  size  p(n,s) 
exactly  computing  c. 

Proof:  Consider  the  set  of  2"  pairs  {(u, c(v))  |  v  G  Xn).  By  Corollary  6.1,  there  exists 
an  algorithm  that,  with  positive  probability,  will  output  an  hypothesis  consistent  with  this 
set  of  elements  of  size  only  polynomial  in  n  and  s.  Since  this  hypothesis  is  polynomially 
evaluatable,  it  can  be  converted  using  standard  techniques  into  a  circuit  of  the  required  size. 


6.4  Hard  Functions  are  Hard  to  Approximate 

By  a  similar  argument,  the  bound  on  hypothesis  size  implies  that  any  function  not  com¬ 
putable  by  small  circuits  cannot  even  be  weakly  approximated  by  a  family  of  small  circuits, 
for  some  distribution  on  the  inputs. 

Let  /  be  a  Boolean  function  on  {0,1}*,  D  a  distribution  on  {0, 1}”  and  C  a  circuit  on  n 
variables.  Then  C  f3 -approximates  f  under  D  if  the  probability  is  at  most  /?  that  C(v)  ^  f(v) 
on  an  assignment  v  chosen  randomly  from  {0, 1}”  according  to  D. 
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Theorem  6.5  Suppose  some  function  f  cannot  be  computed  by  any  family  of  polynomial- 
size  circuits.  Then  there  exists  a  family  of  distributions  D\,  Di, . . where  Dn  is  over  the  set 
{0,1}”,  such  that  for  all  polynomials  p  and  q,  there  exist  infinitely  many  n  for  which  there 
exists  no  n-variable  circuit  of  size  at  most  q(n)  that  (5  —  -^^-approximates  f  under  Dn. 

Proof:  Throughout  this  proof,  we  will  assume  without  loss  of  generality  that  p(n)  =  q(n)  = 
nk  for  some  integer  k  >  1. 

Suppose  first  that  for  some  k  there  exists  for  every  distribution  D  on  {0,1}"  a  circuit  of 
size  at  most  nk  that  -approximates  /  under  D.  Then  /  can,  in  a  sense,  be  weakly 

learned.  More  precisely,  there  exists  an  (exponential  time)  procedure  that,  by  searching 
exhaustively  the  set  of  all  circuits  of  size  nfc,  will  find  one  that  ^ j -approximates  f 
under  some  given  distribution  D.  Therefore,  by  Theorem  3.1,  /  is  strongly  learn&ble  in  a 
similar  sense  in  exponential  time.  Applying  Theorem  6.4  (whose  validity  depends  only  on 
the  size  of  the  output  hypothesis,  and  not  on  the  running  time),  this  implies  that  f  can 
be  exactly  computed  by  a  family  of  polynomial-size  circuits,  contradicting  the  theorem’s 
hypothesis. 

Thus,  for  all  k  >  1,  there  exists  an  integer  n  and  a  distribution  D  on  {0,1}”  such  that 
no  circuit  of  size  at  most  nk  is  able  to  (5  —  ^ -approximate  /  under  D.  To  complete  the 
proof,  it  suffices  to  show  that  this  implies  the  theorem’s  conclusion. 

Let  T>k  be  the  set  of  distributions  D  on  {0,1}"  for  which  no  circuit  of  size  at  most  nk 

Q  —  approximates  /  under  D.  It  is  easy  to  verify  that  Vk  D  Vk+1  for  all  k,  n.  Also, 

since  every  function  can  be  computed  by  exponential  size  circuits,  there  must  exist  a  constant 
c  >  0  for  which  Z>£”  =  0  for  all  n.  Let  n[A:]  be  the  smallest  n  for  which  T>k  ^  0.  By  the 
preceding  argument,  n[k]  must  exist.  Furthermore,  n[fc]  >  k/c ,  which  implies  that  the  set 
N  —  {n[fc]  |  k  >  1}  cannot  have  finite  cardinality. 

To  eliminate  repeated  elements  from  N,  let  k\  <  k2  <  ■  •  •  be  such  that  n[fc,]  ^  n[kj]  for 

i  j,  and  such  that  {n[fcj]  |  *  >  1}  =  N.  Let  Z?,  be  defined  as  follows:  if  1  -  «[&,]  for 

some  j ,  then  let  D{  be  any  distribution  in  (which  cannot  be  empty  by  our  definition  of 
n[k});  otherwise,  if  i  £  N,  then  define  D,  arbitrarily.  Then  Di,D2,...  is  the  desired  family 
of  “hard”  distributions.  For  if  k  is  any  integer,  then  for  all  ki  >  k ,  Z?n[fc,j  €  ^[k,]  ^  ^t{k,y 
This  proves  the  theorem.  ■ 

Informally,  Theorem  6.5  states  that  any  language  not  in  the  complexity  class  P/poly 
cannot  be  even  weakly  approximated  by  any  other  language  in  P/poly  under  some  “hard” 
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family  of  distributions.  In  fact,  the  theorem  can  easily  be  modified  to  apply  to  other  circuit 
classes  as  well,  including  monotone  P/poly,  and  monotone  or  non-monotone  NCfc  for  fixed 
k.  (The  class  NC*  consists  of  all  languages  accepted  by  polynomial-size  circuits  of  depth 
at  most  0(log*n),  and  a  monotone  circuit  is  one  in  which  no  negated  variables  appear.) 
In  general,  the  theorem  applies  to  all  circuit  classes  closed  under  the  transformation  on 
hypotheses  resulting  from  the  construction  of  Sections  3  and  4. 

6.5  On-Line  Learning 

Finally,  we  consider  implications  of  Theorem  6.1  for  on-line  learning  algorithms.  In  the  on¬ 
line  learning  model,  the  learner  is  presented  one  (randomly  selected)  instance  at  a  time  in 
a  series  of  trials.  Before  being  told  its  correct  classification,  the  learner  must  try  to  predict 
whether  the  instance  is  a  positive  or  negative  example.  An  incorrect  prediction  is  called  a 
mistake.  In  this  model,  the  learner’s  goal  is  to  minimize  the  number  of  mistakes. 

Previously,  Haussler,  Littlestone  and  Warmuth  [11]  have  shown  that  a  concept  class  C  is 
learnable  if  and  only  if  there  exists  an  on-line  learning  algorithm  for  C  with  the  properties 
that: 

•  the  probability  of  a  mistake  on  the  mth  trial  is  at  worst  linear  in  m~ 0  for  some  constant 
0  <  $  <  1,  and  (equivalently) 

•  the  expected  number  of  mistakes  on  the  first  m  trials  is  at  worst  linear  in  m°  for  some 
constant  0  <  a  <  1. 

(This  result  is  also  described  in  their  paper  with  Kearns  [10].)  Noting  several  examples 
of  learning  algorithms  for  which  this  second  bound  only  grows  poly-logarithmically  in  m, 
the  authors  ask  if  every  learnable  concept  class  has  an  algorithm  attaining  such  a  bound. 
Theorem  6.6  below  answers  this  open  question  affirmatively,  showing  that  in  general  the 
expected  number  of  mistakes  on  the  first  m  trials  need  only  grow  as  a  polynomial  in  logm. 
Thus,  we  expect  only  a  minute  fraction  of  the  first  m  predictions  to  be  incorrect. 

(This  result  should  not  be  confused  with  those  presented  in  another  paper  by  Haussler, 
Littlestone  and  Warmuth  [12].  In  this  paper,  the  authors  describe  a  general  algorithm 
applicable  to  a  wide  collection  of  concept  classes,  and  they  show  that  the  expected  number 
of  mistakes  made  by  this  algorithm  on  the  first  m  trials  is  linear  in  logm.  However,  their 
algorithm  requires  exponential  computation  time,  even  if  it  is  known  that  the  concept  class 
is  learnable.  In  contrast,  Theorem  6.6  states  that,  if  a  concept  class  is  learnable,  then  there 
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exists  an  efficient  algorithm  making  poly- logarithmic  in  m  mistakes  on  average  on  the  first 
m  trials.) 

Haussler,  Littlestone  and  Warmuth  [11]  also  consider  the  space  efficiency  of  on-line  learn¬ 
ing  algorithms.  They  define  a  space-efficient  learning  algorithm  to  be  one  whose  space  re¬ 
quirements  on  the  first  m  trials  do  not  exceed  a  polynomial  in  n,  s  and  logm.  Thus,  a  space 
efficient  algorithm  is  one  using  far  less  memory  than  would  be  required  to  store  explicitly  all 
of  the  preceding  observations.  The  authors  describe  a  number  of  space-efficient  algorithms 
(though  are  unable  to  find  one  for  learning  unions  of  axis-parallel  rectangles  in  the  plane), 
and  so  are  lead  to  ask  whether  there  exist  space-efficient  algorithms  for  all  learnable  concept 
classes.  Surprisingly,  this  open  question  can  also  be  answered  affirmatively,  as  proved  by  the 
theorem  below. 

Lastly,  Theorem  6.6  gives  a  bound  on  the  computational  complexity  of  on-line  learning  (in 
terms  of  e).  In  particular,  the  total  computation  time  required  to  process  the  first  m  examples 
is  only  proportional  to  mlogcm,  for  some  constant  c.  Thus,  in  a  sense,  the  “amortized”  or 
“average”  computation  time  on  the  mth  trial  is  only  poly-logarithmic  in  m.  (In  fact,  a  more 
careful  analysis  would  show  that  this  is  also  true  of  the  worst  case  computation  time  on  the 
mth  trial.) 


Theorem  6.6  Let  C  be  a  learnable  concept  class.  Then  there  exists  an  efficient  on-line 
learning  algorithm  for  C  with  the  properties  that: 

Pi(n,s,logm) 


•  the  probability  of  a  mistake  on  the  mth  trial  is  at  most 


m 


•  the  expected,  number  of  mistakes  on  the  first  m  trials  is  at  most  p?(n,  s,  log  m), 

•  the  total  computation  time  required  on  the  first  m  trials  is  at  most  m  ■  pz(n,  s,  log  m), 
and 


•  the  space  used  on  the  first  m  trials  is  at  most  p4(n,s,logm), 
for  some  polynomials  p\ ,  P2,  P3,  p4- 

Proof:  Since  C  is  learnable,  there  exists  an  efficient  (batch)  algorithm  satisfying  the  prop¬ 
erties  of  Theorem  6.1.  Let  A  be  such  an  algorithm,  but  with  c/2  substituted  for  both  c 
and  h.  Then  the  chance  that  /4’s  output  hypothesis  incorrectly  classifies  a  randomly  chosen 
instance  is  at  most  c.  (This  technique  is  also  used  by  Haussler  et  al.  [10].) 
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Fix  n  and  s,  and  let  m(e)  be  the  number  of  examples  needed  by  A.  From  Theorem  6.1, 
m(£)  <  (p/e)  •  lgc(  1  /e)  for  some  constant  c  and  some  value  p  implicitly  bounded  by  a  polyno¬ 
mial  in  n  and  s.  Let  e(m)  =  (p/m)  •  lg c(m/p).  Then  it  can  be  verified  that  m(e(m))  <  m  for 
m  >  2p.  Thus,  for  sufficiently  large  m,  e(m)  gives  a  bound  on  the  best  error  rate  achievable 
from  a  sample  of  size  m. 

To  convert  A  into  an  on-line  learning  algorithm  in  a  manner  that  preserves  time  and 
space  efficiency,  imagine  breaking  the  sequence  of  trials  into  blocks  of  increasing  size:  the 
first  block  consists  of  the  first  2 p  trials,  and  each  new  block  has  twice  the  size  of  the  last. 
Thus,  in  general,  the  ith  block  has  size  s,  =  2’p,  and  consists  of  trials  ctj  =  2(2‘~1  —  l)p  +  1 
through  6,  =  2(2‘  —  l)p. 

On  the  trials  of  the  ith  block,  algorithm  A  is  simulated  to  compute  the  ith  hypothesis 
hi.  Specifically,  A  is  simulated  with  e  set  to  e(sj),  which  thus  bounds  the  probability  that 
hi  misclassifies  a  new  instance.  (Note  that  there  are  enough  instances  available  in  this  block 
for  A  to  compute  an  hypothesis  of  the  desired  accuracy.)  On  the  next  block,  as  the  (i  +  l)st 
hypothesis  is  being  computed,  hi  is  used  to  make  predictions;  at  the  end  of  this  block,  /i,  is 
discarded  as  hi+i  takes  its  place. 

Thus,  if  the  mth  trial  occurs  in  the  ith  block  (i.e.,  if  a,  <  m  <  bi),  then  the  probability 
of  a  mistake  is  bounded  by  e(s,_i),  the  error  rate  of  From  the  definition  of  e(),  this 

implies  the  desired  bound  on  the  probability  of  a  mistake  on  the  mth  trial,  and,  in  turn,  on 
the  expected  number  of  mistakes  on  the  first  m  trials. 

Finally,  note  that  on  the  ith  block,  space  is  needed  only  to  store  the  hypothesis  from  the 
last  block  h,_i,  and  to  simulate  A’s  computation  of  block  i’s  hypothesis.  By  Theorem  6.1, 
both  of  these  quantities  grow  polynomially  in  log(l/e).  By  our  choice  of  e,  this  implies  the 
desired  bound  on  the  algorithm’s  space  efficiency.  The  time  complexity  of  the  procedure  is 
bounded  in  a  similar  fashion.  ■ 

7  Conclusions  and  Open  Problems 

We  have  shown  that  a  model  of  learnability  in  which  the  learner  is  only  required  to  perform 
slightly  better  than  guessing  is  as  strong  as  a  model  in  which  the  learner’s  error  can  be  made 
arbitrarily  small.  The  proof  of  this  result  was  based  on  the  filtering  of  the  distribution  in  a 
manner  causing  the  weak  learning  algorithm  to  eventually  learn  nearly  the  entire  distribution. 
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We  have  also  shown  this  proof  implies  a  set  of  general  bounds  on  the  complexity  of  PAC- 
learning  (both  batch  and  on-line),  and  have  discussed  some  of  the  applications  of  these 
bounds. 

It  is  hoped  that  these  results  will  open  the  way  on  a  new  method  of  algorithm  design  for 
PAC-learning.  As  previously  mentioned,  the  vast  majority  of  currently  known  algorithms 
work  by  finding  a  hypothesis  consistent  with  a  large  sample.  An  alternative  approach  sug¬ 
gested  by  the  main  result  is  to  seek  instead  a  hypothesis  covering  slightly  more  than  half  the 
distribution.  Perhaps,  such  an  hypothesis  is  easier  to  find,  at  least  from  the  point  of  view 
of  the  algorithm  designer.  This  approach  leads  to  algorithms  with  a  flavor  similar  to  the 
one  described  for  k- term  DNF  in  Section  5.3,  and  it  is  possible  to  find  similar  algorithms 
for  a  number  of  other  concept  classes  that  are  already  known  to  be  learnable  (for  example, 
k- decision  lists  [20]  and  rank  r  decision  trees  [7]).  To  what  extent  will  this  approach  be 
fruitful  for  other  classes  not  presently  known  to  be  learnable?  This  is  an  open  question. 

Another  open  problem  concerns  the  robustness  of  the  construction  described  in  this 
paper.  Intuitively,  it  seems  that  there  should  be  a  close  relationship  between  reducing  the 
error  of  the  hypothesis,  and  overcoming  noise  in  the  data.  Is  this  a  valid  intuition?  Can  our 
construction  be  modified  to  handle  noise? 

Finally,  turning  away  from  the  theoretical  side  of  machine  learning,  we  can  ask  how  well 
would  our  construction  perform  in  practice?  Often,  a  learning  program  (for  instance,  a  neural 
network)  is  designed,  implemented,  and  found  empirically  to  achieve  a  “good”  error  rate,  but 
no  way  is  seen  of  improving  the  program  further  to  enable  it  to  achieve  a  “great”  error  rate. 
Suppose  our  construction  is  implemented  on  top  of  this  learning  program.  Would  it  help? 
This  is  not  a  theoretical  question,  but  one  that  can  only  be  answered  experimentally,  and  one 
that  obviously  depends  on  the  domain  and  the  underlying  learning  program.  Nevertheless, 
it  seems  p'ausible  that  the  construction  might  in  some  cases  give  good  results  in  practice. 
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